Divergence and Shannon information in genomes 
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Shannon information (SI) and its special case, divergence, are defined for a DNA sequence in 
terms of probabilities of chemical words in the sequence and are computed for a set of complete 
genomes highly diverse in length and composition. We find the following: SI (but not divergence) 
is inversely proportional to sequence length for a random sequence but is length-independent for 
genomes; the genomic SI is always greater and, for shorter words and longer sequences, hundreds 
to thousands times greater than the SI in a random sequence whose length and composition match 
those of the genome; genomic Sis appear to have word-length dependent universal values. The 
universality is inferred to be an evolution footprint of a universal mode for genome growth. 

PACS numbers: PACS number: 87.10.+e, 89.70.-fc, 87.14.Gg, 87.23. Kg, 02.50.-r 



Shannon entropy Jj has been used in almost every 
field concerned with information, including the analy- 
sis of DNA and protein sequences 0j Si S • In the 
field of comparative genomics however it seems not to 
have found any systematic application. The high hetero- 
geneity of complete genomes in length, base composition 
and percentage of coding regions may make comparison 
based on Shannon entropy problematic. Here we show 
that by simple and appropriate definition of a quantity 
we call Shannon information (SI), difficulties associated 
with these issues are surmounted. The SI are applied to 
characterize distribution of occurrence frequency (FD) 
of /c-mers, or words of a fc chemical letters, in complete 
genomes. We present a simple relation between the SI 
and the relative spectral width of an FD, a relation that 
furnishes an intuitive understanding between SI and in- 
formation in a sequence. We show that in spite of their 
high heterogeneity, the Sis in complete genomes can be 
represented by a set of genome independent universal 
lengths. 

Divergence and Shannon information. Consider a 
set of occurrence frequencies, = {fi\ J2l=i fi — ^} = 
{/i|r, L}, for r types of events, where fi/L is the prob- 
ability of event i. The Shannon entropy, or uncertainty 
[i], for is H{T) = - J:^if^/L) Hh/L). The quantity 
attains its maximum value Hmax—^'ciT when all fi are 
equal to the mean frequency f—r^^L. In H, Shannon 
was concerned with the fidelity of messages as they are 
transported through communication devices. Here we are 
interested in the information in itself. There is a gen- 
eral notion that information in a system increases with a 
decrease in uncertainty, hence we identify the quantity, 

D{T) ^ Inr - HiT) = ^ . /. ln(/,//) (1) 

called divergence by Gatlin 0, as the zeroth order SI in 
J-. Shannon called the ratio D{T) / Hmax redundancy. 

Suppose the set !F ={J-'m\m = 1,2,---} is com- 
posed of subsets, — {/i;|t„i, L,„}, each having its 
own distinct types of events t„i , total number of events 



i,„ and mean frequency frn=Lm/Tm, where I]m'^™=T 
and J2m ^rn—L. The divergence of each subset is 
D{Tm)=J2'i fi/Lm ^T^ifi/ fm), whcrc the summation is re- 
stricted to those /i's in the subset Tm- We define the SI 
carried by J-' to be the weighed average of the divergences 
in the subsets: 

R{T) = Y,{L^/L)D{T^) = L-^Y^f, HfJU (2) 



Frequency distribution in a DNA sequence. We 

view a single strand of DNA and as a linear text written 
in the four chemical letters, A, C, G and T representing 
the four kinds of nucleotides. Empirically genomes are 
invariably within a few percent of being compositionally 
self-complementary and, for the present study, it suffices 
to characterize the base composition of a genome by a 
single number, p, the combined probability of (A-t-T). 
From now on the term profile of a sequence will refer to 
the p value and the length L of the sequence. We will 
use the SI in random sequences as benchmarks for the SI 
in genomes. A random sequence having the profile of a 
genome is said to be a random match for the genome. 

For a DNA sequence of length L we denote by T^''^ the 
set {fi\T — 4*^, L}k, where fi is the occurrence frequency 
of the z*'' overlapping fc-letter word, or fc-mer For 
JF^*^^ the number of event types is t=4'^ and the mean 
frequency is f=A^^L. Given J^^'^l we can construct a k- 
spectrum where n/, the number of fc-mers occurring with 
frequency /, is given as a functiion of /. For simplicity 
we also call a fc-spectrum. 

The of a genomes is always composed of k+1 

subsets J-m \ called m-sets, m—1 to k, where J-m^ is 
the subset of fc-mers with m (A-I-T)'s. When a genome 
does not have p«0.5, and most genomes are of this type, 
then the mean frequencies fm of the subsets J^m^ are 
distinct: /™(p)=/2 - p)''"'"- Fig. [T] It^ shows the 

5-spectra of a p=0.691 genome, Clostridium acetobutyli- 
cum, and its random match. The spectrum from the ran- 
dom match (sharp peaks in gray (or green)) is composed 
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of five nonoverlapping subspectra (a sixth one is out of 

range), one for each J^m^ . The corresponding subspec- 
tra (dash and dotted gray curves) underlying the entire 
genomic spectrum (black curve) are also shown; they are 
seen to have overlapping widths. 
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FIG. 1: The 5-pectra of the genome of C acetobutylicum {b\ack) {p^O. 7) 
and that of its random match (green or gray), normalized to that of a 2 Mb 
sequence. The random-match spectrum is composed of sharply peaking sub- 
spectra m— to 4. The center of the m— 5 subspectrum is off scale at 
10,500. Ordinates are not integers because (for better viewing only) large 
fluctuations in the spectra are smoothed out by averaging over a range of 21 
frequencies. 

The types and number of k -mers in ^ra^ are 
Tm='2^{k,m), and L„i—T„ifm, respectively, where {k,m) 
is the binomial coefhcient. As in Eq. ((SJ we define the SI 
in J^i'^^ as the weighed mean of the divergences in Tm^: 
R[jr{k})=Y,',^j^/L\n{fJU. When p approaches 0.5 

all /,„ tend to / and R{T''^'<) approaches L>(J^^*=>). Note 
that R{T^^^) and D{T^^^) are invariant under the re- 
placement p 1-p. 

Relative spectral width. In Eq. Q), by expanding fi 
around / in the logarithm we obtain for a unimodal J- 
the series 



— ' n(n + 1) 
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2/2 



(3) 



where A is the standard deviation in JF, or D{!F) « 0-2/2, 
where a is the relative spectral width (RSW) of T. This 
relation is useful in at least two respects. First, it gives 
one a heuristic understanding why the divergence of and 
information in a (unimodal) spectrum are connected. 
When the spectrum is narrow, there is little informa- 
tion because all the fc-mers occur with almost equal fre- 
quency. Because of the generally substantiated notion 
that highly overrepresented words in a genome are more 
likely to have biological meaning than words represented 
with an average frequency ^ ^ IQj 11], the obverse is 
also true, namely, there is more information when the 
spectrum in broad. Second, Eq. (j^J gives an accurate 
estimate of the divergence in the Tm^ (and T^^^) of a 
random sequence. Provided fm is much greater than one, 



{fc} : 



is proportional to a Poisson distribution with mean 
&fc/m and cr^^„w6fc//„i. Hence 
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[Hill, so that A2 f — -2 



TABLE I: Shannon entropy H and divergence D in units of In 2 
In the fc-spectra of the genome sequence of E. coli 0157:H7 and its 
random match. 



Random matcli 



E. coll 0157 -.HI 



k 


H 


D 


6fc4V2L 


H 


D 


a72 


2 


4.0000 


5.12E-7 


1.06E-6 


3.986 


1.39E-2 


1.39E-2 


3 


6.0000 


6.76E-6 


6.26E-6 


5.953 


4.71E-2 


4.41E-2 


4 


8.0000 


3.03E-5 


2.92E-5 


7.908 


9.16E-2 


8.65E-2 


5 


9.9999 


1.28E-4 


1.25E-4 


9.854 


1.46E-1 


1.42E-1 


6 


11.999 


5.23E-4 


5.18E-4 


11.79 


2.07E-1 


2.11E-1 


7 


13.998 


2.12E-3 


2.10E-3 


13.72 


2.73E-1 


2.90E-1 


8 


15.991 


8.56E-3 


8.48E-3 


15.65 


3.49E-1 


3.87E-1 


9 


17.965 


3.45E-2 


3.42E-2 


17.54 


4.54E-1 


5.22E-1 


10 


19.858 


1.42E-1 


1.37E-1 


19.34 


6.58E-1 


7.70E-1 



D{J-m^) ~ bk/2fm — bkTm/'iLm for a random sequence. 
Then from Eq. Q we have a result for the SI in the jrf '^1 
of any random match, independent of p: 



bkT/2L (Random sequence) (4) 



The combinatorial factor hk should be 1-t~^ for a true 
random sequence; a semi-empirical value 1 — 1/2'^"^ is 
used here because a random match is not fully ran- 
dom: it is made to be approximately compositionally 
self-complementary (as most genomes are) and its p value 
is fixed. That the SI in a random sequence diminishes as 
1/L with increasing L is connected to what is known as 
the large-system rule: the square of the RSW of a dis- 
tribution associated with a random system is inversely 
proportional to the size of the system. We will see that 
the SI in genomes does not follow this rule. 

Shannon information in a p«0.5 genome. The 

genome of E. coli 0157:H7 is 5.53 Mb long and has 
p=0.496. We therefore use Eq. ^ to compute its SI. 
The Shannon entropy and divergence (in units of In 2) in 
the fc-spectra, k—2 to 10, of the genome and its random 
match are given in Table HJ We notice the following: (i) 
For both sequences the Shannon entropy (columns 2 and 
5) is in every case close to its maximum value, 2k. (ii) For 
the random match the value computed using Eq. (col- 
umn 3) is in excellent agreement with the expected value 
(column 4). (Hi) For the smallest fc's, the divergence is a 
minuscule signal buried in a huge Shannon entropy back- 
ground, (iv) This tiny signal, when isolated, cleanly sets 
apart a genome from its random match: the genomic di- 
vergence is greater than its random-match counterpart in 
all cases and, for the smaller fc's, many thousand times 
greater, (v) Eq. Q is verified (last column). 

Difference between divergence and SI. When p is 
not close to 0.5 D{T^''y) and i^(J^t'=>) differ. Consider 
the 5-spectra (not the subspectra) for the genome and 
its random match shown in Fig. Because the 's are 
spread widely, their distribution play a dominant role in 
determining the width of the spectra. If we ignore the 
widths of the individual subspectra and use Eq. Q we get 
7^(^{'^->)«i?(o)(fc,p)=0.5[(2'=(p2 + (i_p)2))'c_i] for both 

the genomic and random spectra (Z?(°)(5, 0.691)=0.488, 
as compared to the actual values of 0.575 and 0.485 for 



the genomic and random 5-spectra in Fig. D{T^''^) 
depends strongly on p, is independent of L, disagrees 
with Eq. 10}, and is therefore not useful for estimating 
the SI of a random sequence. 

The monomer divergence and SI are quantities that are 
completely determined by the base composition of a se- 
quence and are independent of its randomness. Whereas 
R{J-^^^)=0 always, D{T^^^) depends strongly on p and 
vanishes only at p=0.5. The genome of Mycoplasma geni- 
taliumhas p=0.683 and D{J^^^^)— 0.0686 (compared with 
D(°)=0.670). The genome Haemophilus influenzae has 
p=0.618 and D{J^^^^) =0.028,1 (0.0278). Since these val- 
ues say nothing about the randomness or order of the 
sequence yet depend nontrivially on p, they convey less 
information about the sequence than the value ofp alone. 



TABLE II: Divergence (Eq. 


G)) in the 


m-sets of the 


5-spectra of 


three 


genomes 


and their random matches. 






m 


fm 


L„ (kb) 


66/(2,/;„) 




Dgen 


Pyrobaculum 


aerophilum, p 


=0.486, L 


= 2.22 Mb 







480 


30.7 


l.OlE-3 


8.54E-4 


4.32E-2 


1 


500 


192 


9.69E-4 


1.06E-3 


6.88E-2 


2 


520 


499 


9.31E-4 


9.58E-4 


8.71E-2 


3 


541 


692 


8.96E-4 


8.12E-4 


1.33E-1 


4 


563 


540 


8.93E-4 


8.80E-4 


1.80E-1 


5 


586 


225 


8.60E-4 


9.37E-4 


1.58E-1 


6 


610 


39.0 


7.94E-4 


7.65E-4 


7.74E-2 


Chlamydia muridarum, p= 


0.597, L= 


1.07 Mb 







68.7 


4.40 


7.05E-3 


4.87E-3 


l.llE-1 


1 


103 


39.5 


4.70E-3 


4.47E-3 


1.16E-1 


2 


155 


148 


3.12E-3 


3.18E-3 


1.15E-1 


3 


232 


297 


2.08E-3 


2.15E-3 


1.07E-1 


4 


348 


334 


1.39E-3 


1.49E-3 


1.02E-1 


5 


521 


200 


9.29E-4 


1.12E-3 


1.20E-1 


6 


782 


50.0 


6.14E-4 


7.36E-4 


1.52E-1 


Streptomyces avermitilis, p 


=0.293, L= 


= 9.03 Mb 







89.0 


5.70 


5.44E-3 


5.99E-3 


1.73E-1 


1 


214 


82.2 


2.26E-3 


2.27E-3 


1.80E-1 


2 


519 


498 


9.33E-4 


1.04E-3 


4.21E-1 


3 


1252 


1,602 


3.87E-4 


3.78E-4 


2.59E-1 


4 


3024 


2,903 


1.60E-4 


1.66E-4 


1.33E-1 


5 


7300 


2,803 


6.63E-5 


6.14E-5 


6.22E-2 


6 


17,622 


1,127 


2.75E-5 


2.87E-5 


8.87E-2 



The general case. Table ITTl gives the divergences, Dg^n 
and Dran, of the m-sets in the 6-spectra of three se- 
quences and their random matches, P. aerophilum, C. 
muridarum, and S. avermitilis. These sequences span a 
wide range in profile. Once again, Dran (column 5) is 
well approximated by its expected value (column 4) and 
is much less than Dg^n- Dg^n varies but does not ex- 
hibit a clear dependence on L. The 6-mer SI, computed 
according to Eq. 0, for E. coli 0157:H7 and the three 
genomes of Table ^ and their random matches are given 
in Table UTTl Again we have: Rran is well approximated 
by bkT/2L, Rg(,n»Rram and Rgen is independent of 
genome length. 

A new feature is manifest in Table UTTl in spite of their 
greatly varying profiles the four genomes have almost the 
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TABLE III: Shannon information (Eq. Q)) in the 6-spectra of the 
four genomes cited in TablesQand [Mjand their random matches. 



Genome 


P 


L (Mb) 


/ 


Rran 


Rran/ {be /2f) 


Rgen 


E. coli 


0.486 


5.52 


1348 


3.59E-4 


1.01 


0.143 


P. aero. 


0.514 


2.22 


542 


8.97E-4 


0.995 


0.127 


C. muri. 


0.597 


1.07 


261 


1.83E-3 


1.05 


0.108 


S. aver. 


0.293 


9.03 


2203 


2.20E-4 


1.03 


0.144 




Sequence length (b) 



FIG. 2: SI in the fc-spectra for k=2 (O), 3 (<l), 4 (*), 5 (V), 6 (+), 7 (A), 8 
( X ), 9 (l>), and 10 (□), of the genomes of twelve organisms: E cun, E. cuniculi 
(Chromosome I; length 0.198 Mb, p=0.533); CM, C. muridarum; MG, M. 
genitalium (length 0.580 Mb, p— 0.683): SC, yeast (5. cerevisiae, IV; 1.53 
Mb, 0.621): MJ, M. jannaschii (1.66 Mb, 0.686); PA, P. aerophilum; E col, 
E. coli K12 (4.64 Mb, 0.492): SA, S. avermitilis; CE, worm (C. elegans, 
Chr. I; 15.1 Mb, 0.643): DM, fly (D. melanogaster, X; 21.8 Mb, 0.574): 
MM, mouse (M. musculus, IX; 98.9 Mb, 0.562): HS, human (H. sapiens, 
I; 228 Mb, 0.582). The lines give SI expected in random sequences for k—10 
(open circles connected by dotted line) and 100 times the expected value for 
k—2 (solid circles connected by solid line). 

same SI in their 6-spectra. Indeed, at least for word 
lengths up to 10 letters, we have found that SI varies 
little for complete genomes with vastly differing profiles. 
Fig. 121 shows SI as a function of sequence length in six 
prokaryotes and six eukaryotic chromosomes |3- In the 
plot, the symbols in each column, top to bottom, give 
the fc-mer SI, fc=10 (□), 9 (>),. . . , to 2 (O), of an organ- 
ism. The fc-mer SI of a sequence is ill-defined when the 
sequence length is less than twice 4*^, therefore the Sis for 
k=9 and 10 in E. cuniculi and for fc=10 in M. genitalium 
are not given. We have verified that the SI in different 
chromosomes of the same organism vary little. For com- 
parison the solid and dashed lines in Fig. |21 connect the 
expected 2-mer SI (xlOO) and 10-mer SI in the matching 
random sequences. 

Genomes have universal SI. With the exception of 
several 2-mer Sis, all genomic Sis lie within a horizontal 
band bounded by 0.01 and 0.9. (For technical reasons 
too lengthy to describe here but which will be explained 
elsewhere, the 2-mer SI has the largest fluctuations.) For 
a given k the genomic variance in SI is only about a 
factor of two. In comparison the longest sequence - H. 
sapiens (J) at 228 Mb - is about 1,200 times longer than 
the shortest - E. cuniculi I at 0.198 Mb. In this context 
we view the genomic SI as having a genome independent 
but fc-dependent universal value. For each k we express 
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this universal value, Runi{k), in terms of the length, L^, 
called the equivalent root-sequence^ of a random sequence 
whose SI is equal to Runi'- Lr{k)=4f^hk/2Runi{k). Then 
the genomic results in Fig. |21 are summarized by the em- 
pirical relation h\{Lr{k))=ak + C, where a=1.01±0.06, 
C=3.80±0.50. Lr is surprisingly short for small k but 
grows rapidly with k: ir(fc)~340 b, 15 kb and 1.1 Mb, 
for k=2, 6, and 10, respectively. 

Now, if a random root-sequence of length Lr is repli- 
cated a finite number of times, then, provided k<<Lr, 
the fc-mer SI in the product sequence is the same as 



that in the root-sequence. We therefore suugest the ob- 
served universality of SI in complete genomes is an evo- 
lution footprint produced by a universal mode of genome 
growth, a mode that began when the genomes were very 
short - less than 340 b - and in which segmental dupli- 
cation played a major role [islfl^. Work exploring the 
implication of this notion will be reported elsewhere. 

This work is partly supported by grants 92-2112-M- 
008-040 and 93-231 l-B-008-006 from the National Sci- 
ence Council (ROC) to HCL. 
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